LSH Ensemble: Internet-Scale Domain Search
نویسندگان
چکیده
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment score, defined as |Q ∩X|/|Q|, as the measure of relevance of a domain X to a query domain Q. Our choice of Jaccard set containment over Jaccard similarity as a measure of relevance makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes. We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches using Minwise Hashing and domain partitioning. Our index structure does not assume a prescribed set of data values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the data partitioning for LSH Ensemble as an optimization problem. We prove that there exists an optimal partitioning for any data distribution. Furthermore, for datasets following a powerlaw distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it particularly efficient to use in practice. We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 million domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 million domains, our index sustains query performance with under 3 seconds response time. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 9, No. 12 Copyright 2016 VLDB Endowment 2150-8097/16/08. Country Number of Datasets (Structured and Semi-Structured) US 191,695 UK 26,153 Canada 244,885 Singapore 11,992 Table 1: Examples of Governmental Open Data as of First Quarter 2016.
منابع مشابه
Interactive Navigation of Open Data Linkages
We developed Toronto Open Data Search to support the ad hoc, interactive discovery of connections or linkages between datasets. It can be used to efficiently navigate through the open data cloud. Our system consists of three parts: a user-interface provided by a Web application; a scalable backend infrastructure that supports navigational queries; and a dynamic repository of open data tables. O...
متن کاملNearBucket-LSH: Efficient Similarity Search in P2P Networks
We present NearBucket-LSH, an effective algorithm for similarity search in large-scale distributed online social networks organized as peer-to-peer overlays. As communication is a dominant consideration in distributed systems, we focus on minimizing the network cost while guaranteeing good search quality. Our algorithm is based on Locality Sensitive Hashing (LSH), which limits the search to col...
متن کاملLearning to Search Efficiently in High Dimensions
High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means...
متن کاملS2JSD-LSH: A Locality-Sensitive Hashing Schema for Probability Distributions
To compare the similarity of probability distributions, the information-theoretically motivated metrics like KullbackLeibler divergence (KL) and Jensen-Shannon divergence (JSD) are often more reasonable compared with metrics for vectors like Euclidean and angular distance. However, existing locality-sensitive hashing (LSH) algorithms cannot support the information-theoretically motivated metric...
متن کاملSC-LSH: An Efficient Indexing Method for Approximate Similarity Search in High Dimensional Space
Locality Sensitive Hashing (LSH) is one of the most promising techniques for solving nearest neighbour search problem in high dimensional space. Euclidean LSH is the most popular variation of LSH that has been successfully applied in many multimedia applications. However, the Euclidean LSH presents limitations that affect structure and query performances. The main limitation of the Euclidean LS...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 9 شماره
صفحات -
تاریخ انتشار 2016